280 ◾ Bioinformatics
of “q2-vsearch” plugin. The input is the last preprocessed data artifact “demux-yoga-
merged.qza”.
qiime vsearch dereplicate-sequences \
--i-sequences inputs/demux-yoga-merged.qza \
--o-dereplicated-table inputs/derep-yoga-table.qza \
--o-dereplicated-sequences inputs/derep-yoga-seqs.qza
The outputs from the “dereplicate-sequences” command are two artifacts: (i) feature table
containing the OTU features with their observed abundances (frequencies) for each of the
samples of the study and (ii) feature data in which each feature identifier is mapped to a
feature.
7.3.4.2.1.3 Clustering Methods
Clustering then follows the dereplication. Both feature table and feature data artifacts gen-
erated with the dereplication are required as inputs for clustering. We will use QIIME2
to perform the three types of clustering (de novo, closed-reference, and open-reference
clustering).
7.3.4.2.1.3.1 De Novo Clustering
The de novo clustering does not require a database but it uses sequence similarity to cluster
features into groups. The threshold for similarity can be set to 0.99% (only reads similar
to centroid sequence with identity of 99% are allowed to join the cluster). QIIME2 uses
“cluster-features-de-novo” method of “q2-vsearch” plugin to perform the de novo cluster-
ing. The input artifacts are the feature table and feature data artifacts generated by the
dereplication in the previous step. To keep the clustering in a separate directory, we will
create the “denovo” subdirectory.
mkdir denovo
qiime vsearch cluster-features-de-novo \
--i-table inputs/derep-yoga-table.qza \
--i-sequences inputs/derep-yoga-seqs.qza \
--p-perc-identity 0.99 \
--o-clustered-table denovo/table-yoga-denovo.qza \
--o-clustered-sequences denovo/rep-seqs-yoga-denovo.qza
The outputs are two artifacts: a feature table for the OTUs and feature data that contains
the centroid sequences defining each OTU cluster. De novo clustering usually consumes
more computational resources compared to the other two methods.
7.3.4.2.1.3.2 Closed-Reference Clustering
Closed-reference clustering requires a curated database for the16S rRNA gene sequences as
reference sequences. Only the representative sequences that have matches on the database
are clustered, while the ones that do not have matches will be discarded. Examples of widely
used databases include Greengenes (16S rRNA) at “https://greengenes.secondgenome.